389 research outputs found

    A Dynamic I/O-Efficient Structure for One-Dimensional Top-k Range Reporting

    Full text link
    We present a structure in external memory for "top-k range reporting", which uses linear space, answers a query in O(lg_B n + k/B) I/Os, and supports an update in O(lg_B n) amortized I/Os, where n is the input size, and B is the block size. This improves the state of the art which incurs O(lg^2_B n) amortized I/Os per update.Comment: In PODS'1

    A Simple Parallel Algorithm for Natural Joins on Binary Relations

    Get PDF

    Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

    Get PDF
    In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join

    Parallel Acyclic Joins with Canonical Edge Covers

    Get PDF
    In PODS'21, Hu presented an algorithm in the massively parallel computation (MPC) model that processes any acyclic join with an asymptotically optimal load. In this paper, we present an alternative analysis of her algorithm. The novelty of our analysis is in the revelation of a new mathematical structure -- which we name "canonical edge cover" -- for acyclic hypergraphs. We prove non-trivial properties for canonical edge covers that offer us a graph-theoretic perspective about why Hu's algorithm works.Comment: Accepted to ICDT'2

    Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges

    Get PDF
    A family R of ranges and a set X of points, all in R^d, together define a range space (X, R|_X), where R|_X = {X cap h | h in R}. We want to find a structure to estimate the quantity |X cap h|/|X| for any range h in R with the (rho, epsilon)-guarantee: (i) if |X cap h|/|X| > rho, the estimate must have a relative error epsilon; (ii) otherwise, the estimate must have an absolute error rho epsilon. The objective is to minimize the size of the structure. Currently, the dominant solution is to compute a relative (rho, epsilon)-approximation, which is a subset of X with O~(lambda/(rho epsilon^2)) points, where lambda is the VC-dimension of (X, R|_X), and O~ hides polylog factors. This paper shows a more general bound sensitive to the content of X. We give a structure that stores O(log (1/rho)) integers plus O~(theta * (lambda/epsilon^2)) points of X, where theta - called the disagreement coefficient - measures how much the ranges differ from each other in their intersections with X. The value of theta is between 1 and 1/rho, such that our space bound is never worse than that of relative (rho, epsilon)-approximations, but we improve the latter\u27s 1/rho term whenever theta = o(1/(rho log (1/rho))). We also prove that, in the worst case, summaries with the (rho, 1/2)-guarantee must consume Omega(theta) words even for d = 2 and lambda <=3. We then constrain R to be the set of halfspaces in R^d for a constant d, and prove the existence of structures with o(1/(rho epsilon^2)) size offering (rho,epsilon)-guarantees, when X is generated from various stochastic distributions. This is the first formal justification on why the term 1/rho is not compulsory for "realistic" inputs

    Range Updates and Range Sum Queries on Multidimensional Points with Monoid Weights

    Get PDF
    Let P be a set of n points in ?^d where each point p ? P carries a weight drawn from a commutative monoid (?, +, 0). Given a d-rectangle r_upd (i.e., an orthogonal rectangle in ?^d) and a value ? ? ?, a range update adds ? to the weight of every point p ? P? r_upd; given a d-rectangle r_qry, a range sum query returns the total weight of the points in P ? r_qry. The goal is to store P in a structure to support updates and queries with attractive performance guarantees. We describe a structure of O?(n) space that handles an update in O?(T_upd) time and a query in O?(T_qry) time for arbitrary functions T_upd(n) and T_qry(n) satisfying T_upd ? T_qry = n. The result holds for any fixed dimensionality d ? 2. Our query-update tradeoff is tight up to a polylog factor subject to the OMv-conjecture

    Towards Optimal Dynamic Indexes for Approximate (and Exact) Triangle Counting

    Get PDF
    • …
    corecore